Data Curation like Paint Preparation

Derek Lowe's blog "In the Pipeline", on pharmaceutical drug discovery, seems often insightful. A recent post "AI, Machine Learning and the Pandemic" comments:

... The biggest point to remember, when talking about AI/ML and drug discovery, is that these techniques will not help you if you have a big problem with insufficient information. They don't make something from nothing. Instead, they sort through huge piles of Somethings in ways that you don't have the resources or patience to do yourself. That means (first) that you must be very careful about what you feed these computational techniques at the start, because "garbage in, garbage out" has never been more true than it is with machine learning. Indeed, data curation is a big part of every successful ML effort, for much the same reason that surface preparation is a big part of every successful paint job.
And second, it means that there is a limit on what you can squeeze out of the information you have. What if you've curated everything carefully, and the pile of reliable data still isn't big enough? That's our constant problem in drug research. There are just a lot of things that we don't know, and sometimes we are destined to find out about them very painfully and expensively. Look at that oft-quoted 90% failure rate across clinical trials: is that happening because people are lazy and stupid and enjoy shoveling cash into piles and lighting it on fire? Not quite: it's generally because we keep running into things that we didn't know about. Whoops, turns out Protein XYZ is not as important as we thought in Disease ABC – the patients don't really get much better. Or whoops, turns out that drugs that target the Protein XYZ pathway also target other things that we had never seen before and that cause toxic effects, and the patients actually get worse. No one would stumble into things like that on purpose. Sometimes, in hindsight, we can see how such things might have been avoided, but often enough it's just One of Those Things, and we add a bit more knowledge to the pile, at great expense. ...

... and last year, in "Machine-Mining the Literature":

... In the same way as the old line about how armchair military buffs talk strategy and tactics while professionals talk logistics, professionals in this field tend to devote a lot of time to data curation and preparation. That's partly because the real-world data we would like to use are often in rather shaggy piles, and also because even the best machine-learning techniques tend to be a bit finicky and brittle compared to what you'd actually want. We're used to that with internal combustion engines: diesel fuel, ethanol, gasoline, and jet fuel are not perfectly interchangeable in most situations, and so it is with engines of knowledge. They are tuned up for specific types of input, and will stall if fed something else. To use a different analogy, data curation is very much akin to the advice that you should spend more time preparing a surface for a good paint job than you do in applying the actual paint. In almost every case, you will definitely spend more time getting your data in shape for machine learning than the actual computations will take. ...

and in conclusion, Lowe notes:

... Extending this to the biomedical literature will be quite an effort – many will recall that this is just what one aspect of "Watson For Drug Discovery" was supposed to do (root through PubMed for new correlations). As I mentioned in that linked post, though, the failure of Watson (and some other well-hyped approaches, some of which are in the process of failing now, I believe) does not mean that the whole idea is a bust. It just means that it's hard. And that people who are promising you that they've solved it and that you can get in on the ground floor if you'll just pull out your wallet should be handled with caution. The paper today gives us a hint of what could be possible, eventually, after a lot of time, a lot of money, and a lot of (human) brainpower. Bring it on! ...

Great wisdom: "It's Hard!"

(cf Awesomely Simple (2001-01-26), Simplicity via Abstraction (2016-01-07), Taxonomy of Machine Learning (2017-02-02), ...) - ^z - 2020-07-30